skin-cancer.csv file available on Canvas.This week, a simple dataset will be explored with and without missing data and its effect on modelling.
Consider this older data set that shows the mortality rates (number of deaths per 10 million people) skin cancer for some US states against the latitude of the state. The data is available in skin-cancer.csv on Canvas.
full.data <- read.csv("skin-cancer.csv", header = TRUE)
Conduct a simple linear regression with the Mortality rates as the response and Latitude as the predictor. Plot the data along with the regression and interpret the regression output.
Solution
full.lm <- lm(Mort ~ Lat, data = full.data)
plot(Mort ~ Lat, data = full.data)
abline(full.lm)
summary(full.lm)
##
## Call:
## lm(formula = Mort ~ Lat, data = full.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.972 -13.185 0.972 12.006 43.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
## Lat -5.9776 0.5984 -9.99 3.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13
Simulate half the values in the Latitude feature to be missing at random values. (i.e. replace half the elements with missing NA values). Note the assignment of missing values to be is random but not completely at random. That is, the chance that Latitude is missing should depend on Mortality. A good convenient function for this purpose is available in mice::ampute. The return object of this function will contain the data with missingness in the element amp. E.g. if you assigned the output of the mice::ampute to an object called output, the data with missing values will be available in output$amp. We’ll call this the amputed data.
Solution
set.seed(5003)
library(mice)
available.data = ampute(full.data,
prop = 0.5,
patterns = data.frame(Mort = 1, Lat = 0))
Conduct a simple linear regression now on the amputed data but instead regressing Latitude on Mortality. That is, do a regression where Latitude is the response and Mortality is the predictor but using the amputed data where some Latitude cases are missing. (Note that R will automatically remove missing values from the model when using lm)
Solution
simple.missing <- lm(Lat ~ Mort, data = available.data$amp)
summary(simple.missing)
##
## Call:
## lm(formula = Lat ~ Mort, data = available.data$amp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8554 -1.6638 0.2069 1.6368 4.3635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.17688 3.09569 17.824 3.7e-14 ***
## Mort -0.10157 0.02132 -4.764 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.543 on 21 degrees of freedom
## (26 observations deleted due to missingness)
## Multiple R-squared: 0.5194, Adjusted R-squared: 0.4965
## F-statistic: 22.69 on 1 and 21 DF, p-value: 0.0001049
Conduct a complete-case regression on the amputed data with Mortality as the response and Latitude as the predictor and show that it is not consistent with the original regression using the full data.
Solution
avaiable.lm <- lm(Mort ~ Lat, data = available.data$amp)
plot(Mort ~ Lat, data = full.data, col = "black")
points(Mort ~ Lat, data = available.data$amp, pch = 18, col = "blue")
abline(full.lm, lty = 'dotted', col = "black")
abline(avaiable.lm, col = "blue")
summary(avaiable.lm)
##
## Call:
## lm(formula = Mort ~ Lat, data = available.data$amp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.923 -12.402 -1.923 9.934 39.530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350.893 43.794 8.012 8.03e-08 ***
## Lat -5.113 1.073 -4.764 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.05 on 21 degrees of freedom
## (26 observations deleted due to missingness)
## Multiple R-squared: 0.5194, Adjusted R-squared: 0.4965
## F-statistic: 22.69 on 1 and 21 DF, p-value: 0.0001049
legend("bottomleft", legend = c("Full data", "Available data", "Full reg", "Available reg"), col = rep(c("black", "blue"), 2), pch = c(21, 18, rep(NA, 2)), lty = c(rep(NA, 2), c("dotted", "solid")))
The differences are slight but can see that the intercept and slope estimates are off with a smaller intercept on the model computed with missing data and the slope is also less steep at -5 instead of -5.9.
Randomly impute the missing Latitude observations. That is, predict the missing Latitudes using the output of section 1.3 using random imputation. That is, do the prediction from the linear model in 1.3 but randomly simluate some noise variables using the estimated variability from 1.3 too. Use these random imputed values to fill in the missing values in the amputed data. Call this new data.frame the imputed data. Then conduct the regression with Mortality as the response and Latitude as the predictor on the imputed dataset and compare your results with regression that used the the original complete data with no missingness. Compare the results of the imputed model to the full model without missingness.
Solution
missing.lats <- available.data$amp %>%
filter(is.na(Lat)) %>%
select(Mort)
simple.preds <- predict(simple.missing, newdata = missing.lats)
rand.preds <- simple.preds + rnorm(length(simple.preds), sd = sigma(simple.missing))
imputed.data <- available.data$amp
imputed.data[["Lat"]][is.na(imputed.data[["Lat"]])] <- rand.preds
imputed.lm <- lm(Mort ~ Lat, data = imputed.data)
full.range <- lapply(full.data, range)
imputed.range <- lapply(imputed.data, range)
yrange <- range(c(full.range$Mort, imputed.range$Mort))
xrange <- range(c(full.range$Lat, imputed.range$Lat))
plot(Mort ~ Lat, data = full.data, ylim = yrange, xlim = xrange)
abline(full.lm, lty = "dotted")
points(Mort ~ Lat, data = available.data$amp, col = "blue", pch = 21)
abline(avaiable.lm, col = "blue")
points(Mort ~ Lat, data = imputed.data, col = "red", pch = 18)
abline(imputed.lm, lty = "dashed", col = "red")
legend("bottomleft", legend = c("Full data", "Available data", "Imputed data", "Full reg", "Available reg", "Imputed reg"), col = rep(c("black", "blue", "red"), 2), pch = c(21, 21, 18, rep(NA, 3)), lty = c(rep(NA, 3), c("dotted", "solid", "dashed")))
summary(imputed.lm)
##
## Call:
## lm(formula = Mort ~ Lat, data = imputed.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.329 -13.807 -3.511 15.276 41.598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 379.1282 23.4159 16.191 < 2e-16 ***
## Lat -5.7085 0.5866 -9.731 7.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.46 on 47 degrees of freedom
## Multiple R-squared: 0.6683, Adjusted R-squared: 0.6612
## F-statistic: 94.69 on 1 and 47 DF, p-value: 7.662e-13
The data imputation has corrected the bias that was found in the regression estimates before and resulted in estimates that are much closer to the estimates on the full data as evidenced in the plot where the lines are very close.